f(x1, x2) = (x1 + x2)^2
Assume that x1, x2 ~ U[-1,1] and x1=x2 (full dependency)
$g^{PD}_{x_1}(z) = E_{X_{-x_1}}[f(X_{x_1|=z})] = E_{X_{-x_1}}[z^2 + 2zX_2 + X_2^2] = z^2 + 2z \cdot 0 + \frac{1}{3} = z^2 + \frac{1}{3} $\ Since $E[U[-1, 1]] = 0$ and $E[U[-1, 1]^2] = \frac{1}{3}$
Keeping in mind that because x1=x2, then sampling x2 given x1=z will give x2=z
$g^{MP}_{x_1}(z) = E_{X_{-x_1}|x_1=z}[f(X_{x_1|=z})] = E_{X_{-x_1}|x_1=z}[z^2 + 2zX_2 + X_2^2] = z^2 + 2z^2 + z^2 = 4z^2$
$g^{AL}_{x_1}(z) = \int^{z}_{-1}[E_{X_{-x_1}|x_1=v}\frac{\partial f(x)}{\partial x_1}]dv = \int^{z}_{-1}[E_{X_{-x_1}|x_1=v}[2x_1+2x_2]]dv = \int^{z}_{-1}4vdv = 2z^2 - 2$
We will be investigating the same dataset as in HW1: spambase.csv from OpenML-100 databases. This database concerns emails, of which some were classified as spam emails (~39%), whereas the rest were work and personal emails. We will be training a Random Forest Classifier and an XGBoost.
Below are a 2 random entries of the dataframe and predictions of different models.
| Observation | True label | RF prediction | XGBoost prediction |
|---|---|---|---|
| Observation 4 | 1 | 0.9524 | 0.986 |
| Observation 5 | 1 | 0.2884 | 0.635 |
As we can see, RF has predicted the true label for the first observation, but not for the second. XGBoost fared slightly better and predicted both observations correctly
Below is a chart showing the CP profiles of the 2 observations, along 2 different variables:
Comment:
Below a chart with PDP of the same 2 variables, with individual CP profiles
Comment:
Below a chart showing PDP of our Random Forest vs XGBoost
Comment:
import numpy as np
import pandas as pd
import dalex as dx
import lime
spambase = pd.read_csv("spambase.csv")
df = spambase.drop(spambase.columns[0], axis=1) #Cleaning first column which is just index
df.describe()
| word_freq_make | word_freq_address | word_freq_all | word_freq_3d | word_freq_our | word_freq_over | word_freq_remove | word_freq_internet | word_freq_order | word_freq_mail | ... | word_freq_table | word_freq_conference | char_freq_%3B | char_freq_%28 | char_freq_%5B | char_freq_%21 | char_freq_%24 | char_freq_%23 | capital_run_length_average | TARGET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | ... | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 |
| mean | 0.104553 | 0.213015 | 0.280656 | 0.065425 | 0.312223 | 0.095901 | 0.114208 | 0.105295 | 0.090067 | 0.239413 | ... | 0.005444 | 0.031869 | 0.038575 | 0.139030 | 0.016976 | 0.269071 | 0.075811 | 0.044238 | 5.191515 | 0.394045 |
| std | 0.305358 | 1.290575 | 0.504143 | 1.395151 | 0.672513 | 0.273824 | 0.391441 | 0.401071 | 0.278616 | 0.644755 | ... | 0.076274 | 0.285735 | 0.243471 | 0.270355 | 0.109394 | 0.815672 | 0.245882 | 0.429342 | 31.729449 | 0.488698 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.588000 | 0.000000 |
| 50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.065000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.276000 | 0.000000 |
| 75% | 0.000000 | 0.000000 | 0.420000 | 0.000000 | 0.380000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.160000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.188000 | 0.000000 | 0.315000 | 0.052000 | 0.000000 | 3.706000 | 1.000000 |
| max | 4.540000 | 14.280000 | 5.100000 | 42.810000 | 10.000000 | 5.880000 | 7.270000 | 11.110000 | 5.260000 | 18.180000 | ... | 2.170000 | 10.000000 | 4.385000 | 9.752000 | 4.081000 | 32.478000 | 6.003000 | 19.829000 | 1102.500000 | 1.000000 |
8 rows × 56 columns
X = df.loc[:, df.columns != 'TARGET']
y = df.loc[:, df.columns == 'TARGET']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=2)
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_validate
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
kf = KFold(n_splits = 5)
from sklearn.ensemble import RandomForestClassifier
RF_final = RandomForestClassifier(n_estimators=200, max_depth = 8, max_features = 0.3, random_state = 1).fit(X, y)
print("Train accuracy: ", accuracy_score(y, RF_final.predict(X)))
C:\Users\Antek\AppData\Local\Temp\ipykernel_19912\3443453795.py:1: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). RF_final = RandomForestClassifier(n_estimators=200, max_depth = 8, max_features = 0.3, random_state = 1).fit(X, y)
Train accuracy: 0.9565311888719844
RFexplainer = dx.Explainer(RF_final, X, y)
cp = RFexplainer.predict_profile(new_observation=X.iloc[4:6])
C:\Users\Antek\anaconda3\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but RandomForestClassifier was fitted with feature names
Preparation of a new explainer is initiated -> data : 4601 rows 55 cols -> target variable : Parameter 'y' was a pandas.DataFrame. Converted to a numpy.ndarray. -> target variable : 4601 values -> model_class : sklearn.ensemble._forest.RandomForestClassifier (default) -> label : Not specified, model's class short name will be used. (default) -> predict function : <function yhat_proba_default at 0x000001E17A372940> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 0.0118, mean = 0.394, max = 0.992 -> model type : classification will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -0.902, mean = 7.09e-05, max = 0.947 -> model_info : package sklearn A new explainer has been created!
Calculating ceteris paribus: 100%|██████████| 55/55 [00:01<00:00, 51.93it/s]
y.iloc[4:6]
| TARGET | |
|---|---|
| 4 | 1 |
| 5 | 1 |
observations = X.iloc[4:6]
print(RFexplainer.predict(observations))
[0.95239126 0.28840994]
cp.plot(variables=["char_freq_%24", "capital_run_length_average"])
pdp = RFexplainer.model_profile()
Calculating ceteris paribus: 100%|██████████| 55/55 [00:15<00:00, 3.52it/s]
pdp.plot(variables=["char_freq_%24", "capital_run_length_average"], geom="profiles", title="Partial Dependence Plot with individual profiles")